Heterogeneity in a relationship between two variables means that their relationship “varies” with respect to some other variable(s). For our purposes, this can mean that the relationship varies in time (temporal heterogeneity) or it varies across locations (spatial heterogeneity).
Aaron first discovered heterogeneity in the doctor visits indicator, which inspired this DAP. The idea of sensorization, which we apply in this DAP to attempt to account for heterogeneity, has a long history in Delphi; see David’s thesis and Maria’s paper.
In this notebook, we perform a deep investigation of one particular sensorization procedure: a moving window regression, where we fit using data from 10 weeks back to 1 week back. We find that dynamic sensorization is able to correct spatial heterogeneity (improve geo-wise correlations), with neutral effects on time-wise correlations and ability to forecast future cases/deaths. Validation setups suggest that it may be possible that we are overfitting to our targets. I would be far more confident geo-wise correlation results if we get some measures of spread (e.g., bootstrap confidence intervals).
In the sensorization procedure, we have two types of variables: indicators and targets. Indicators are the variables that we wish to sensorize, and targets are the variables that the variables are sensorized “against”. The indicators considered in this report are the smoothed versions of: (% COVID-19) Doctor Visits; Facebook %CLI; Facebook %CLI-in-Community; (% COVID-19) Hospitalizations from Claims. The targets considered in this report are the smoothed versions of Cases per 100,000 population and Deaths per 100,000 population. All analyses are done at the county level, except for the “ground truth” hospitalizations validation.
For the county level analysis, we only consider counties with at least 500 cumulative cases by November 1st, 2020. For the state level analysis, we consider all fifty US States and the District of Columbia.
This is measured through the “simple forecaster” setup that Ryan has built.
First, we fix notation. Assume an indicator and target (e.g., doctors visits and case rate), which we suppress notationally for brevity. Each observation is then represented as \((x_{t\ell}, y_{t\ell})\), where \(x\) is the indicator value, \(y\) is the target value, \(t\) represents time (measured in dates), and \(\ell\) represents location. Let \(L\) denote the set of all valid locations, e.g., all counties. Let \(x_{t\cdot}\) denote all the \(x_{t\ell}\) collected across locations in \(L\), and similarly for \(y_{t\cdot}\). Let \(x_{t_1:t_2, \ell}\), \(y_{t_1:t_2, \ell}\) denote the observations that fall within times \(t_1, t_2\), endpoints included. Finally, let \(x, y\) be the collection of all observations across time and location.
This is just the unsensorized indicator itself.
In the basic, spatial-only form of sensorization (as computed in Aaron’s notebook), we ignore the possibility of temporal heterogeneity and learn a single linear relationship between the indicator and target for each location. Specifically, we learn, for each \(\ell \in L\) \[ y_{\cdot\ell} \sim x_{\cdot\ell} \qquad\Rightarrow \mathrm{Model}(\ell) \] and obtain the sensorized indicator values \[ \tilde x_{t\ell} = \texttt{predict}(x_{t\ell}, \mathrm{Model}(\ell)) = \hat y_{t\ell}. \] Originally computed merely as a point of comparison, we are now seriously considering some form of “static” sensorization to avoid overfitting; more details are given in the sequel. It is important to note that although static sensorization does not allow the model to vary with time, the version considered in this notebook has the important advantage that sensorized estimates are produced in-sample.
Let \(k_1, k_2\) denote the number of days into the past we wish to examine data when fitting our model. (Smaller \(k_1-k_2\) is “more adaptive” to temporal heterogeneity, but may also lead to less stable fits).
We fit, for each \(t, \ell\): \[ y_{(t-k_1):(t-k_2), \ell} \sim x_{(t-k_1):(t-k_2), \ell} \qquad\Rightarrow \mathrm{Model}(\ell, t, k_1, k_2) \] and obtain the sensorized indicator values \[ \tilde x_{t\ell} = \texttt{predict}(x_{t\ell}, \mathrm{Model}(\ell, t, k_1, k_2)). \] We fit a linear model for each location and day, and then take the prediction for that day.
For our purposes we take \(k_1, k_2\) to be \(70, 8\). That is, we train over a nine-week window, without the immediately preceding week of data.
Overall, we find that geo-wise correlations improve; and there is neutral impact on time-wise correlations and usefulness in simple forecasting.
For each of our sensorized indicators, we compute its geo-wise rank correlation against the target data. Assuming that the target data observes the proper “ordering” for each day, an improvement in the geo-wise rank correlations in the sensorized indicators would mean that for a fixed day, the (sensorized) indicator’s values are (more) comparable across location. This is important, for example, if we want the indicator to be intuitively useful on a map.
We generally find that the geo-wise correlations are improved by sensorization, both dynamic and static. This is encouraging. It is noteworthy that dynamic sensorization does essentially as well as static sensorization here, since this static sensorization has the added advantage that estimates are being produced in-sample. (Why do we see a temporary dip in the rank correlation during June for any indicator dynamically sensorized against Cases?)
sensorize_time_ranges = list(
c(-70, -8)
)
splot_idx = 1
df_cor_list = vector('list', length(source_names))
for (ind_idx in 1:length(source_names)) {
base_cor_fname = sprintf('../sensorization_scripts/results/02_base_cors_%s_%s_%s_%s.RDS',
geo_level,
source_names[ind_idx], signal_names[ind_idx],
target_names[ind_idx])
df_cor_base = readRDS(base_cor_fname)
sensorize_fname = sprintf('../sensorization_scripts/results/03_sensorize_cors_%s_%s_%s_%s.RDS',
geo_level,
source_names[ind_idx], signal_names[ind_idx],
target_names[ind_idx])
sensorize_val_fname = sprintf('../sensorization_scripts/results/03_sensorize_vals_%s_%s_%s_%s.RDS',
geo_level,
source_names[ind_idx], signal_names[ind_idx],
target_names[ind_idx])
sensorize_cors = readRDS(sensorize_fname)[[splot_idx]]
df_cor = bind_rows(df_cor_base, sensorize_cors)
df_cor$Indicator[df_cor$Indicator == sprintf('Sensorized (TS, %d:%d)',
sensorize_time_ranges[[splot_idx]][1],
sensorize_time_ranges[[splot_idx]][2])
] = 'dynamic'
df_cor$Indicator = factor(df_cor$Indicator,
levels=c('raw', 'static', 'dynamic'))
df_cor$sensor_target = sensor_target_names[ind_idx]
df_cor_list[[ind_idx]] = df_cor
}
df_cor = bind_rows(df_cor_list)
df_cor$sensor_target = factor(
df_cor$sensor_target,
levels=sensor_target_names)
plt = ggplot(
df_cor,
aes(x = time_value,
y = value,
color=Indicator)
) + geom_line(
) + facet_wrap (
vars(sensor_target),
ncol=2,
) + theme_bw (
) + theme(
legend.position = "bottom",
legend.title = element_blank()
) + labs (
x = 'Date',
y = 'Rank correlation (geo-wise)'
)
print(plt)
For each of our sensorized indicators, we compute time-wise correlations over a moving window of time 6 weeks wide. Specifically, we iterate across time, and on each date and for each location, we consider take the sensorized values (raw, static, and dynamic) from the preceding 42 days and compute the time-wise correlation with the corresponding target data. We then reduce the location component out of this data by taking the median (and mad) of the correlations for each day.
Previously, Aaron found (communicated on Slack) that dynamic sensorization appeared to degrade the time-wise correlation of Doctor visits against cases. I reproduced this apparent result, but also plotted the spreads of the time-wise correlations. The distributions of the time-wise correlations are overlapping, so I am not as concerned. I believe that the most we can say is that dynamic sensorization does not provide any added advantage in time-wise correlations, compared to static sensorization or the raw indicator.
In the following figure, the rank correlations for the static sensorized estimates should be the same as those for the raw indicator; but small discrepancies in county-level data availability yield very small differences.
timewise_cors_list = vector('list', length(source_names))
for (ind_idx in 1:length(source_names)) {
timewise_cor_fname = sprintf(
'../sensorization_scripts/results/04_timewise_cors_%s_%s_%s_%s_-41_0.RDS',
geo_level, source_names[ind_idx],
signal_names[ind_idx], target_names[ind_idx])
timewise_cor_list = readRDS(timewise_cor_fname)
timewise_cors = timewise_cor_list[[1]]
timewise_cors_summarized = timewise_cor_list[[2]]
min_dynamic_correlate = timewise_cors %>% filter (
sensorization == 'dynamic',
) %>% pull (
correlate_date,
) %>% min
timewise_cors_summarized = timewise_cors_summarized %>% filter (
sensorization %in% c('raw', 'static') |
correlate_date >= min_dynamic_correlate+4,
)
timewise_cors_summarized$sensor_target = sensor_target_names[ind_idx]
timewise_cors_list[[ind_idx]] = timewise_cors_summarized
}
timewise_cors_summarized = bind_rows(timewise_cors_list)
timewise_cors_summarized$sensor_target = factor(
timewise_cors_summarized$sensor_target,
levels = sensor_target_names)
timewise_cors_summarized$sensorization = factor(
timewise_cors_summarized$sensorization,
levels=c('raw', 'static', 'dynamic'))
plt = ggplot(
timewise_cors_summarized,
aes(x=correlate_date,
colour=sensorization),
) + geom_line (
aes(y=med,
linetype='median'),
alpha=0.8,
) + geom_line (
aes(y=med+mad,
linetype='med±mad'),
) + geom_line (
aes(y=med-mad,
linetype='med±mad'),
) + scale_linetype_manual(
values=c("median"="solid",
"med±mad"="dashed"),
# "min/max"="dotted"),
breaks=c('median',
'med±mad'),
# 'min/max'),
) + geom_hline(
yintercept = 0,
linetype = 3,
color = "gray"
) + facet_wrap (
vars(sensor_target),
ncol=2,
scale='free_y',
) + theme_bw (
) + theme(
legend.position = "bottom",
legend.title = element_blank()
) + labs (
x='Date',
y='Rank correlation (time-wise, over trailing 6 weeks)'
) + ggtitle (
'Timewise correlations'
)
print(plt)
As an additional sanity check, I computed the pointwise differences in time-wise correlation within each location, and then took the median (in essense, we interchange the difference and median operations). My concern was basically that the spreads of the time-wise correlations may be large across locations, but perhaps the dynamic sensorization fares consistently worse than the raw indicator by a small amount.
Fortunately, the variability across locations still covers zero, which is a bit “reassuring”.
timewise_cors_list = vector('list', length(source_names))
for (ind_idx in 1:length(source_names)) {
timewise_cor_fname = sprintf(
'../sensorization_scripts/results/04_timewise_cors_%s_%s_%s_%s_-41_0.RDS',
geo_level, source_names[ind_idx],
signal_names[ind_idx], target_names[ind_idx])
timewise_cor_list = readRDS(timewise_cor_fname)
timewise_cors = timewise_cor_list[[1]]
min_dynamic_correlate = timewise_cors %>% filter (
sensorization == 'dynamic',
) %>% pull (
correlate_date,
) %>% min
timewise_cors_summarized = timewise_cors %>% filter (
sensorization %in% c('raw') |
correlate_date >= min_dynamic_correlate+4,
) %>% pivot_wider (
names_from = sensorization,
values_from = value,
) %>% mutate (
static = static - raw,
dynamic = dynamic - raw,
raw = raw - raw,
) %>% pivot_longer (
cols = c(raw, static, dynamic),
names_to = 'sensorization',
values_to = 'value',
) %>% group_by (
correlate_date,
sensorization,
) %>% summarize (
med=median(value, na.rm=TRUE),
mad=mad(value, na.rm=TRUE),
min=min(value, na.rm=TRUE),
max=max(value, na.rm=TRUE),
) %>% ungroup
timewise_cors_summarized$sensor_target = sensor_target_names[ind_idx]
timewise_cors_list[[ind_idx]] = timewise_cors_summarized
}
timewise_cors_summarized = bind_rows(timewise_cors_list)
timewise_cors_summarized$sensor_target = factor(
timewise_cors_summarized$sensor_target,
levels = sensor_target_names)
timewise_cors_summarized$sensorization = factor(
sapply(timewise_cors_summarized$sensorization,
function(x) {sprintf('(%s-raw)', x)}),
levels=c('(raw-raw)', '(static-raw)', '(dynamic-raw)'))
plt = ggplot(
timewise_cors_summarized,
aes(x=correlate_date,
colour=sensorization),
) + geom_line (
aes(y=med,
linetype='median'),
) + geom_line (
aes(y=med+mad,
linetype='med±mad'),
) + geom_line (
aes(y=med-mad,
linetype='med±mad'),
) + scale_linetype_manual(
values=c("median"="solid",
"med±mad"="dashed"),
# "min/max"="dotted"),
breaks=c('median',
'med±mad'),
# 'min/max'),
) + facet_wrap (
vars(sensor_target),
ncol=2,
) + theme_bw (
) + theme(
legend.position = "bottom",
legend.title = element_blank()
) + labs (
x='Date',
y='Change in rank correlation (time-wise, over trailing 6 weeks)'
) + ggtitle (
'Change in timewise correlations (compared to raw indicator)'
)
print(plt)
For each (indicator, target) pair, we perform the static and dynamic sensorizations, and then evaluate its ability to “improve upon” the predictive ability of the target. The setup is the same as in Ryan’s forecasting notebook. We perform 7 day ahead and 14 day ahead forecasts. The base model only uses the target information (cases or deaths), at lags of 0, 7, and 14 days behind. The other three models add the raw indicator, static sensorized indicator, and dynamic sensorized indicator to the target (all three at the corresponding lags), i.e., six features in each of the latter three models.
The four models (for each (indicator, target) pair) are evaluated in terms of scaled error. This normalizes the error by the error of the strawman, which propogates the previous week forward. Scaled errors below 1 are desirable.
We find, on the whole, that including the sensorized indicator neither helps nor hurts in the prediction task. This is discouraging for us; we might hope that sensorization, when performed properly, would provide “additional information” beyond the target. We display the median scaled errors over time (with the ±mad beyond the y-axis range), followed by a table of the median scaled errors over all time. Although there are minute differences between the different sensorization techniques, neither static nor dynamic reigns supreme.
As an aside, it is interesting to note that the scaled errors get smaller, and the spread of their distributions narrower, in November. I suspect this is because we have seen a continual surge in cases since October, and the consistent upward trajectory is harder for the strawman to capture (but scaled error spikes - presumably related to Thanksgiving reporting irregularities).
model_names = c('Targets',
'Targets+Raw',
'Targets+Static',
'Targets+Dynamic')
lags = 1:2 * -7
leads = 1:2 * 7
plot_df_list = vector('list', length(source_names))
table_list = vector('list', length(source_names))
min_time_value = Inf
max_time_value = -Inf
for (ind_idx in 1:length(source_names)) {
predictive_fname = sprintf(
'../sensorization_scripts/results/05_predictive_full_models_%s_%s_%s_%s.RDS', geo_level,
source_names[ind_idx], signal_names[ind_idx],
target_names[ind_idx])
res = readRDS(predictive_fname)
# Restrict to common period for all 4 models, then calculate the scaled errors
# for each model, that is, the error relative to the strawman's error
res_all4 = res %>%
#res_all4 = res %>% filter(geo_value %in% geo_values) %>%
drop_na() %>% # Restrict to common time
mutate(err1 = err1 / err0, err2 = err2 / err0, # Compute relative error
err3 = err3 / err0, err4 = err4 / err0) %>% # to strawman model
mutate(dif12 = err1 - err2, dif13 = err1 - err3, # Compute differences
dif14 = err1 - err4) %>% # relative to cases model
ungroup() %>%
select(-err0)
# Calculate and print median errors, for all 4 models, and just 7 days ahead
res_err4 = res_all4 %>%
select(-starts_with("dif")) %>%
pivot_longer(names_to = "model", values_to = "err",
cols = -c(geo_value, time_value, lead)) %>%
mutate(lead = factor(lead, labels = paste(leads, "days ahead")),
model = factor(model, labels = model_names))
min_time_value = min(res_err4$time_value, min_time_value)
max_time_value = max(res_err4$time_value, max_time_value)
table_list[[ind_idx]] = res_err4 %>%
select(-starts_with("dif")) %>%
group_by(model, lead) %>%
summarize(err = median(err, na.rm=TRUE),
n = length(unique(time_value))) %>%
arrange(lead) %>% ungroup() %>%
rename("Model" = model, "Median scaled error" = err,
"Target" = lead, "Test days" = n) %>%
mutate(`Ind // Target` = sensor_target_names_short[[ind_idx]])
plot_df = res_err4 %>% group_by(
model, lead, time_value
) %>% summarize(
med = median(err),
mad = mad(err),
min = min(err),
max = max(err),
) %>% ungroup()
plot_df$sensor_target = sensor_target_names_newline[ind_idx]
plot_df_list[[ind_idx]] = plot_df
}
plot_df = bind_rows(plot_df_list)
table_df = bind_rows(table_list)
plt = ggplot(
plot_df,
aes(x=time_value,
colour=model)
) + geom_line (
aes(y=med,
linetype='median'),
) + scale_linetype_manual(
values=c("median"="solid")
) + scale_color_manual(
values = c("black",
'#F8766D',
'#00BA38',
'#619CFF')
) + geom_hline(
yintercept = 1,
linetype = 2,
color = "gray"
) + facet_grid(
cols=vars(lead),
rows=vars(sensor_target),
scales='free',
) + coord_cartesian (
#0.5, 2
#0.5, 1.5
ylim = c(0.60, 1.1)
) + labs(
x = "Date",
y = "Median scaled error (relative to strawman)"
) + theme_bw (
) + theme(
legend.pos = "bottom",
legend.title = element_blank()
)
print(plt)
table_df = table_df %>% mutate (ITT = sprintf('%s, %s',
`Ind // Target`, Target))
table_df$ridx = 1:nrow(table_df)
min_idxs = table_df %>% group_by(
ITT
) %>% slice(
which.min(`Median scaled error`)
) %>% ungroup %>% pull (
ridx
)
knitr::kable(table_df %>% select(-c(`Ind // Target`, Target, ITT, ridx)),
caption = paste("Test period:", min_time_value, "to",
max_time_value),
format = "html", table.attr = "style='width:70%;'") %>%
kableExtra::row_spec(min_idxs, bold=T, color='black',
background='lightgreen') %>%
kableExtra::pack_rows(index=table(forcats::fct_inorder(
table_df[['ITT']])))
| Model | Median scaled error | Test days |
|---|---|---|
| Doctor visits // Cases, 7 days ahead | ||
| Targets | 0.9650675 | 207 |
| Targets+Raw | 0.9610658 | 207 |
| Targets+Static | 0.9481376 | 207 |
| Targets+Dynamic | 0.9628810 | 207 |
| Doctor visits // Cases, 14 days ahead | ||
| Targets | 0.9626064 | 193 |
| Targets+Raw | 0.9551195 | 193 |
| Targets+Static | 0.9357050 | 193 |
| Targets+Dynamic | 0.9592159 | 193 |
| Facebook CLI // Cases, 7 days ahead | ||
| Targets | 0.9499766 | 203 |
| Targets+Raw | 0.9441602 | 203 |
| Targets+Static | 0.9331479 | 203 |
| Targets+Dynamic | 0.9472326 | 203 |
| Facebook CLI // Cases, 14 days ahead | ||
| Targets | 0.9577211 | 189 |
| Targets+Raw | 0.9535906 | 189 |
| Targets+Static | 0.9412451 | 189 |
| Targets+Dynamic | 0.9580655 | 189 |
| Facebook CLI-in-community // Cases, 7 days ahead | ||
| Targets | 0.9468342 | 194 |
| Targets+Raw | 0.9349286 | 194 |
| Targets+Static | 0.9317961 | 194 |
| Targets+Dynamic | 0.9418860 | 194 |
| Facebook CLI-in-community // Cases, 14 days ahead | ||
| Targets | 0.9577329 | 180 |
| Targets+Raw | 0.9450651 | 180 |
| Targets+Static | 0.9349043 | 180 |
| Targets+Dynamic | 0.9619411 | 180 |
| Hospitalizations // Cases, 7 days ahead | ||
| Targets | 0.9425200 | 207 |
| Targets+Raw | 0.9378645 | 207 |
| Targets+Static | 0.9348197 | 207 |
| Targets+Dynamic | 0.9396245 | 207 |
| Hospitalizations // Cases, 14 days ahead | ||
| Targets | 0.9447452 | 193 |
| Targets+Raw | 0.9462358 | 193 |
| Targets+Static | 0.9488539 | 193 |
| Targets+Dynamic | 0.9522232 | 193 |
| Hospitalizations // Deaths, 7 days ahead | ||
| Targets | 0.9668329 | 207 |
| Targets+Raw | 0.9400759 | 207 |
| Targets+Static | 0.9336758 | 207 |
| Targets+Dynamic | 0.9634197 | 207 |
| Hospitalizations // Deaths, 14 days ahead | ||
| Targets | 0.9771000 | 193 |
| Targets+Raw | 0.9436540 | 193 |
| Targets+Static | 0.9248299 | 193 |
| Targets+Dynamic | 0.9652567 | 193 |
We extend the above analysis to multiple leads, from 5 days ahead to 20 days ahead. We also split the data into two time ranges over which we compute the medians, owing to Ryan’s observation that different approaches achieve different performance depending on the behavior of cases (e.g., a surge).
Disappointingly, the models with the static sensorized estimates do best, while the ones with the dynamically sensorized estimates tend to do worse than with just the raw indicator (!).
(This is disappointing because the static sensorized estimates are “cheating”, i.e., in sample, and the dynamic sensorized ones are more “realistic” because they are computed in an online fashion). Perhaps this can be viewed in light of the fact that dynamic sensorization tends to degrade time-wise correlations – and this notion of “proper ordering across time” is important for forecasting. (Evidence of overfitting?)
model_names = c('Targets',
'Targets+Raw',
'Targets+Static',
'Targets+Dynamic')
leads = 5:20
table_list = vector('list', length(source_names))
min_time_value = Inf
max_time_value = -Inf
for (ind_idx in 1:length(source_names)) {
predictive_fname = sprintf(
'../sensorization_scripts/results/08_predictive_full_models_%s_%s_%s_%s.RDS', geo_level,
source_names[ind_idx], signal_names[ind_idx],
target_names[ind_idx])
res = readRDS(predictive_fname)
# Restrict to common period for all 4 models, then calculate the scaled errors
# for each model, that is, the error relative to the strawman's error
res_all4 = res %>%
#res_all4 = res %>% filter(geo_value %in% geo_values) %>%
drop_na() %>% # Restrict to common time
mutate(err1 = err1 / err0, err2 = err2 / err0, # Compute relative error
err3 = err3 / err0, err4 = err4 / err0) %>% # to strawman model
mutate(dif12 = err1 - err2, dif13 = err1 - err3, # Compute differences
dif14 = err1 - err4) %>% # relative to cases model
ungroup() %>%
select(-err0)
# Calculate and print median errors, for all 4 models, and just 7 days ahead
res_err4 = res_all4 %>%
select(-starts_with("dif")) %>%
pivot_longer(names_to = "model", values_to = "err",
cols = -c(geo_value, time_value, lead)) %>%
mutate(model = factor(model, labels = model_names))
min_time_value = min(res_err4$time_value, min_time_value)
max_time_value = max(res_err4$time_value, max_time_value)
table_list[[ind_idx]] = res_err4 %>%
select(-starts_with("dif")) %>%
mutate(time_period =
factor(ifelse(time_value < lubridate::ymd('2020-10-15'),
'Before October 15th',
'October 15th onward'),
labels=c('Before October 15th',
'October 15th onward'))) %>%
group_by(model, lead, time_period) %>%
summarize(err = median(err, na.rm=TRUE),
n = length(unique(time_value))) %>%
arrange(lead) %>% ungroup() %>%
rename("Model" = model, "Median scaled error" = err,
"days_ahead" = lead, "Test days" = n) %>%
mutate(`Ind // Target` = sensor_target_names_short[[ind_idx]])
}
table_df = bind_rows(table_list)
make_scaled_err_plot = function(ggp) {
ggp + geom_line (
) + geom_point (
) + geom_hline(
yintercept = 1,
linetype = 2,
color = "gray"
) + facet_wrap (
vars(`Ind // Target`),
ncol=1,
scales='free',
) + scale_color_manual(
values = c("black",
'#F8766D',
'#00BA38',
'#619CFF')
) + theme_bw (
) + theme(
legend.pos = "bottom",
legend.title = element_blank()
)
}
plt_pre = ggplot(
table_df %>% filter (time_period == 'Before October 15th'),
aes(x=days_ahead,
y=`Median scaled error`,
color=Model),
) %>% make_scaled_err_plot (
) + ggtitle (
'Before October 15th'
) + labs (
y = 'Median scaled error'
)
plt_post = ggplot(
table_df %>% filter (time_period == 'Before October 15th'),
aes(x=days_ahead,
y=`Median scaled error`,
color=Model),
) %>% make_scaled_err_plot (
) + ggtitle (
'October 15th onward'
) + labs (
y = 'Median scaled error'
)
gridExtra::grid.arrange(plt_pre, plt_post, ncol=2)
Our sensorization procedure runs the risk of “overfitting”. Up until now, we have evaluated our sensorized indicator by calculating correlations (or making elementary forecasts) against the very target used in the sensorization step. These evaluation metrics can be trivially maximized by replicating the target in the sensorization, e.g., by taking an extremely short time window.
In the sensorization procedure itself, we guard against overfitting by using a wide time window (9 weeks) and by ignoring the immediately preceding week of data (an air gap between training and evaluation data). In the following subsections, we provide two validation assessments.
Our first validation procedure takes advantage of newly available “ground truth” hospitalization data, provided by the Department of Health & Human Services at the State level, with consistent nationwide coverage beginning in mid-July.
The idea is as follows: if the sensorized Hospitalizations indicator has been overfit to Cases (or Deaths), then it would observe weakened correlations against this ground truth hospitalizations incidence. (Ideally, we would see improved correlations, as the sensorization should be “improving” the indicator, e.g., by correcting spatial heterogeneity).
We ingest the data through Epidata. Then, we produce four target “indicators” from the data, each on the US State level:
Note that we (1) calculate incidence within each State by normalizing by population and multiplying by 100,000; and (2) shift the time_value of the data by one day, because by default each date reports on the new patients for the previous day.
For two sensorized indicators (Hospitalizations, sensorized separately against Cases and Deaths), we compute the geo-wise correlations against the HHS Hospitalization indicators. We find that:
hosp_cor_df = readRDS('../sensorization_scripts/results/07_hhs_cor_df.RDS')
hosp_cor_df = hosp_cor_df %>% filter (
#indicator == 'Hospitalizations',
indicator %in% c('Doctor visits', 'Hospitalizations'),
)
hosp_cor_df$correlate_target = stringr::str_replace(
hosp_cor_df$correlate_target,
': ',
'\n')
plt = ggplot(
hosp_cor_df,
aes(x=time_value,
y=value,
colour=sensorization)
) + geom_line (
) + facet_grid (
rows = vars(correlate_target),
cols = vars(sensor_target),
) + theme_bw (
) + theme(
legend.position = "bottom",
legend.title = element_blank()
) + labs (
x='Date',
y='Rank correlation (geo-wise)'
)
print(plt)
As a second form of validation, we illustrate the coefficients fitted by the dynamic sensorization method, as a function of time. Examining these coefficients serves as a useful sanity check. In fact, the original dynamic sensorization schemes originally considered, which involved short time windows, displayed superb improvements in the geo-wise correlations, but an investigation into the fitted coefficients revealed that those “improvements” were merely the by product of replicating cases in the intercept.
The coefficient distributions leave much to be desired. In the perfect world of our “data model”, the intercepts would be zero (e.g., no COVID-related doctor visits would correspond to no cases); and the slopes would carry the information, perhaps varying over time to account for temporal heterogeneity.
What we see in our data is quite different. The intercepts tend to be away from zero, while the slopes tend to be near zero (often, the median ± mad interval covers zero), despite our purposeful choice of a wide, nine-week wide training window to avoid trivially fitting a moving average.
An additional cause for concern is the phenomenon observed from mid-August to mid-September (shaded in blue), in which the intercept values rise and the slope values fall, as though information is “shifting” from the slope terms into the intercept terms. This phenomenon is especially pronounced in the Doctor visits, Facebook %CLI, and Hospitalizations signals when sensorized against Cases.
Independent of the “meaning” of this simultaneous increase in the intercepts and decrease in the slopes, it is especially concerning that it happens for (nearly) all the indicators sensorized against Cases, at the same time. This suggests that the shifts are due to a change in the behavior of Cases, as a simultaneous change in the behavior of the indicators is unlikely. We should be especially mindful of this, i.e., that the “heterogeneity” being picked up by sensorization can be instability in the target data.
df_coef_list = vector('list', length(source_names))
for (ind_idx in 1:length(source_names)) {
sensorize_val_fname = sprintf('../sensorization_scripts/results/03_sensorize_vals_%s_%s_%s_%s.RDS',
geo_level,
source_names[ind_idx], signal_names[ind_idx],
target_names[ind_idx])
sensorize_vals = readRDS(sensorize_val_fname)[[splot_idx]]
df_coef = sensorize_vals %>% select(
geo_value,
time_value,
intercept,
slope,
)
df_coef$sensor_target = sensor_target_names[ind_idx]
df_coef_list[[ind_idx]] = df_coef
}
coef_df = bind_rows(df_coef_list) %>% pivot_longer (
cols = c(slope, intercept),
names_to = 'coefficient',
values_to = 'value',
)
coef_df_summarized = coef_df %>% group_by (
time_value,
sensor_target,
coefficient,
) %>% summarize (
med = median(value, na.rm=TRUE),
mad = mad(value, na.rm=TRUE),
) %>% ungroup
make_coef_plot = function(ggp) {
ggp + geom_rect(
aes(xmin = lubridate::ymd('2020-08-15'),
xmax = lubridate::ymd('2020-09-15'),
ymin = -Inf,
ymax = Inf),
alpha = 0.03,
fill='#E0FFFF',
) + geom_hline(
yintercept = 0,
linetype = 2,
color = "gray"
) + geom_line (
aes(y=med,
linetype='median'),
) + geom_line (
aes(y=med+mad,
linetype='med±mad'),
) + geom_line (
aes(y=med-mad,
linetype='med±mad'),
) + scale_linetype_manual(
values=c("median"="solid",
"med±mad"="dashed",
"min/max"="dotted"),
breaks=c('median',
'med±mad',
'min/max'),
) + facet_wrap (
vars(sensor_target),
ncol=1,
scale='free_y',
) + theme_bw (
) + theme(
axis.title.x=element_blank(),
legend.position = "bottom",
legend.title = element_blank()
)
}
plt_intercept = ggplot(
coef_df_summarized %>% filter (coefficient=='intercept'),
aes(x=time_value),
) %>% make_coef_plot (
) + labs (
y = 'Intercepts'
)
plt_slope = ggplot(
coef_df_summarized %>% filter (coefficient=='slope'),
aes(x=time_value),
) %>% make_coef_plot (
) + labs (
y = 'Slopes'
)
gridExtra::grid.arrange(plt_intercept, plt_slope, ncol=2)
Sensorization, both dynamic and static, has been found to improve the geo-wise correlations of the sensorized indicator against its intended target. Changes in time-wise correlations and ability to forecast future are neutral. Validation setups suggest the spatial correction does work, with some small areas for concern. Future directions include uncertainty estimates for the geo-wise correlations.
In the context of the predictive exercise, how do the sensorized indicators compare against each other? We perform the predictive task where we train each model with only the (0, 7, 14 day lagged versions of the) sensorized indicator (without lagged Cases/Deaths as base covariates).
In the pointwise medians, it appears that the static sensorization consistently edges out dynamic sensorization. Perhaps dynamic sensorization is overfitting? However, any differences between their performance are overwhelmed by the size of the (pointwise) spread (outside the plot limits). Since the difference in the pointwise medians is apparent, we do not provide the full table of resuls. Interestingly, we do not see the same spike in scaled error as before – my guess is this is Cases/Deaths is omitted from these models, so the irregularities aren’t “seen” by the models.
model_names = c('Targets',
'Raw',
'Static',
'Dynamic')
leads = 1:2 * 7
plot_df_list = vector('list', length(source_names))
for (ind_idx in 1:length(source_names)) {
predictive_fname = sprintf(
'../sensorization_scripts/results/06_predictive_reduced_models_%s_%s_%s_%s.RDS',
geo_level,
source_names[ind_idx], signal_names[ind_idx],
target_names[ind_idx])
res = readRDS(predictive_fname)
# Restrict to common period for all 4 models, then calculate the scaled errors
# for each model, that is, the error relative to the strawman's error
res_all4 = res %>%
#res_all4 = res %>% filter(geo_value %in% geo_values) %>%
drop_na() %>% # Restrict to common time
mutate(err1 = err1 / err0, err2 = err2 / err0, # Compute relative error
err3 = err3 / err0, err4 = err4 / err0) %>% # to strawman model
mutate(dif12 = err1 - err2, dif13 = err1 - err3, # Compute differences
dif14 = err1 - err4) %>% # relative to cases model
ungroup() %>%
select(-err0)
# Calculate and print median errors, for all 4 models, and just 7 days ahead
res_err4 = res_all4 %>%
select(-starts_with("dif")) %>%
pivot_longer(names_to = "model", values_to = "err",
cols = -c(geo_value, time_value, lead)) %>%
mutate(lead = factor(lead, labels = paste(leads, "days ahead")),
model = factor(model, labels = model_names))
plot_df = res_err4 %>% group_by(
model, lead, time_value
) %>% summarize(
med = median(err),
mad = mad(err),
min = min(err),
max = max(err),
) %>% ungroup()
plot_df$sensor_target = sensor_target_names_newline[ind_idx]
plot_df_list[[ind_idx]] = plot_df
}
plot_df = bind_rows(plot_df_list)
plt = ggplot(
plot_df %>% filter (model != 'Targets'),
aes(x=time_value,
colour=model)
) + geom_line (
aes(y=med,
linetype='median'),
) + scale_color_manual(
values = c(
'#F8766D',
'#00BA38',
'#619CFF')
) + geom_hline(
yintercept = 1,
linetype = 2,
color = "gray"
) + facet_grid(
cols=vars(lead),
rows=vars(sensor_target),
scales='free',
) + coord_cartesian (
ylim = c(0.5, 2.5)
) + labs(
x = "Date",
y = "Median scaled error (relative to strawman)"
) + theme_bw (
) + theme(
legend.pos = "bottom",
legend.title = element_blank()
)
print(plt)